Highly Accurate Distributed Classification of Web Documents
نویسندگان
چکیده
With the rapid growth of internet, it is a scientific challenge and a massive economic need to discover an efficient and accurate text classifier for handling tons of online documents. This paper presents a distributed model for efficient web document classifications. In the model, the distributed text classifiers are trained serially with the weights on the training instances, which are adaptively set according to their previous performances. Based on the distributed model, Unequal Bagging (UBagging), an improved technique of bagging for text classifier is also proposed. Results from the experiments show that our approach could gain higher classification accuracy over traditional centralized text classifiers, and require less memory and computational time.
منابع مشابه
An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملLearning Distributed Document Representations for Multi-Label Document Categorization
Multi-label Document Categorization, the task of automatically assigning a text document into one or more categories has various real-world applications such as categorizing news articles, tagging Web pages, maintaining medical patient records and organizing digital libraries among many others. Statistical Machine Learning approaches to document categorization have focused on multi-label learni...
متن کاملDocument clustering based on keyword frequency and concept matching technique in Hadoop
. The term big data has come into use in recent years. It is used to refer to the ever-increasing amount of information that organisations are storing, processing and analysing. Owing to the growing number of information sources big data based file systems are necessary. Due to the explosion growth of digital information, automatic document clustering or categorization has become more important...
متن کامل